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Abstract 

The sparse group lasso optimization problem is solved using a coordinate 
gradient descent algorithm. The algorithm is applicable to a broad class of 
convex loss functions. Convergence of the algorithm is established, and the 
algorithm is used to investigate the performance of the multinomial sparse 
group lasso classifier. On three different real data examples the multino- 
mial group lasso clearly outperforms multinomial lasso in terms of achieved 
classification error rate and in terms of including fewer features for the clas- 
sification. The run-time of our sparse group lasso implementation is of the 
same order of magnitude as the multinomial lasso algorithm implemented in 
the R package glmnet. Our implementation scales well with the problem size. 
One of the high dimensional examples considered is a 50 class classification 
problem with 10k features, which amounts to estimating 500k parameters. 
The implementation is available as the R package msgl. 

Keywords: Sparse group lasso, classification, high dimensional data 
analysis, coordinate gradient descent, penalized loss. 



1. Introduction 

The sparse group lasso is a regularization method that combines the lasso 
[1] and the group lasso [2]. Friedman et al. [3] proposed a coordinate descent 
approach for the sparse group lasso optimization problem. Simon et al. [I] 
used a generalized gradient descent algorithm for the sparse group lasso and 
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considered applications of this method for linear, logistic and Cox regression. 
We present a sparse group lasso algorithm suitable for high dimensional prob- 
lems. This algorithm is applicable to a broad class of convex loss functions. 
In the algorithm we combine three non-differentiable optimization methods: 
the coordinate gradient descent [SJ, the block coordinate descent [0] and a 
modified coordinate descent method. 



Our main application is to multinomial regression. In Section 1.1 



we 

introduce the general sparse group lasso optimization problem with a convex 
loss function. Part |T] investigates the performance of the multinomial sparse 
group lasso classifier. In Part [TT] we present the general sparse group lasso 
algorithm and establish convergence. 

The formulation of an efficient and robust sparse group lasso algorithm is 
not straight forward due to non-differentiability of the penalty. Firstly, the 
sparse group lasso penalty is not completely separable, which is problematic 
when using a standard coordinate descent scheme. To obtain a robust algo- 
rithm an adjustment is necessary. Our solution is a minor modification of 
the coordinate descent algorithm; it efficiently treats the singularity at zero 
that cannot be separated out. Secondly, our algorithm is a Newton type al- 
gorithm; hence we sequentially optimize penalized quadratic approximations 
of the loss function. This approach raises another challenge: how to reduce 
the costs of computing the Hessian? In Section 4.6 we show that an upper 
bound on the Hessian is sufficient to determine whether the minimum over a 
block of coefficients is attained at zero. This approach enables us to update 
a large percentage of the blocks without computing the complete Hessian. 
In this way we reduce the run-time, provided that the upper bound of the 
Hessian can be computed efficiently We found that this approach reduces 
the run-time on large data sets by a factor of more than 2. 

Our focus is on applications of the multinomial sparse group lasso to 
problems with many classes. For this purpose we choose three multiclass 
classification problems. We found that the multinomial group lasso and 
sparse group lasso perform well on these problems. The error rates were 
substantially lower than the best obtained with multinomial lasso, and the 
low error rates were achieved for models with fewer features having non-zero 
coefficients. For example, we consider a text classification problem consisting 
of Amazon reviews with 50 classes and 10k textual features. This problem 
showed a large improvement in the error rates: from approximately 40% for 
the lasso to less than 20% for the group lasso. 

We provide a generic implementation of the sparse group lasso algorithm 
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in the form of a C++ template library. The implementation for multinomial 
and logistic sparse group lasso regression is available as an R package. For 
our implementation the time to compute the sparse group lasso solution is of 
the same order of magnitude as the time required for the multinomial lasso 
algorithm as implemented in the R-package glmnet. The computation time 
of our implementation scales well with the problem size. 

1.1. Sparse group lasso 

Consider a convex, bounded below and twice continuously differentiable 
function / : R n — > R. We say that $ G R n is a sparse group lasso minimizer 
if it is a solution to the unconstrained convex optimization problem 

minimize / + A$ (1) 

where $ : lR n — > R is the sparse group lasso penalty (defined below) and 
A > 0. 

Before defining the sparse group lasso penalty some notation is needed. 
We decompose the search space 

R n = R"i x • • • x R rim 

into m G N blocks having dimensions rii G N for i — 1, . . . , m, hence n = 
ni + ■ ■ ■ + n m . For a vector (3 G R n we write (3 = (/3^, . . . , (3^) where 
/?W el" 1 ,..., f3^ G R Hm . For J — 1, ... ,m we call f3^ the J'th block of 
f3. We use the notation @\ J ^ to denote the €th coordinate of the J'th block 
of (3, whereas (3i is the z'th coordinate of j3. 

Definition 1 (Sparse group lasso penalty). The sparse group lasso penalty 
is defined as 

m n 

$(/?) = f (l -a)J2lJ ||/3 (J) || 2 + IAI 
j=i i=i 

for a G [0,1], group weights 7 G [0, oo) m , and parameter weights £ = 
G [0,oo) n w/jere£« G [0, oo) ni , . . . , C {m) e [0,oo) nm . 

The sparse group lasso penalty includes the lasso penalty (a = 1) and 
the group lasso penalty (a = 0). Note also that for sufficiently large values 
of A the minimizer of (JIJ is zero. The infimum of these, denoted A max , is 
computable, see Section |4~2| 

We emphasize that the sparse group lasso penalty is specified by 
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• a grouping of the parameters (3 = (/3^\ . . . , (3^) 

• and the weights a, 7 and £. 

In Part H] we consider multinomial regression; here the parameter grouping 
is given by the multinomial model. For multinomial as well as other regression 
models, the grouping can also reflect a grouping of the features. 



Part I 

The multinomial sparse group 
lasso classifier 

In this section we examine the characteristics of the multinomial sparse group 
lasso method. Our main interest is the application of the multinomial sparse 
group lasso classifier to problems with many classes. For this purpose we 
have chosen three classification problems based on three different data sets, 
with 10, 18 and 50 classes. In [7] the micro RNA expression profile of different 
types of primary cancer samples is studied. In Section 3.1| we consider the 



problem of classifying the primary site based on the microRNA profiles in this 
data set. The Amazon reviews author classification problem, presented in 



[8], is studied in Section 3.2 The messenger RNA profile of different human 



muscle diseases is studied in [9]. We consider, in Section 3.3, the problem of 
classifying the disease based on the messenger RNA profiles in this data set. 
Table Q] summarizes the dimensions and characteristics of the data sets and 
the associated classification problems. Finally, in Section |4| we examine the 
characteristics of the method applied to simulated data sets. 



2. Setup 

Consider a classification problem with K classes, N samples, and p fea- 
tures. Assume given a data set (x^, yi), . . . , (xjv> Vn) where, for all i = 
1, . . . , N, Xi G W is the observed feature vector and yi G {1, . . . , K} is 
the categorical response. We organize the feature vectors in the N xp design 
matrix 

v dcf f \ T 

X = [Xi- ■■X N ) . 
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As in [TU] we use a symmetric parametrization of the multinomial model. 
With /i:{l,...,i(}xRMI given by 

E fc =i ex p(%) 

the multinomial model is specified by 

P{y l = l\x i ) = h{l,(3^ + (3x i ). 

The model parameters are organized in the i^-dimensional vector, /3^°\ of 
intercept parameters together with the K x p matrix 

/3^(pW...pV)), (2) 

where (3^ G M x are the parameters associated with the z'th feature. The 
lack of identifiability in this parametrization is in practice dealt with by the 
penalization. 

The log-likelihood is 

TV 

£(^°\p) = J2^gh(y t ,P^ + (3x l ). (3) 

i=l 

Our interest is the sparse group lasso penalized maximum likelihood estima- 
tor. That is, (f3(°\ (3) is estimated as a minimizer of the sparse group lasso 
penalized negative-log-likelihood: 

/ p Kp \ 

- £(^°\ 0) + A (1 - a) £ 7j ¥ J \ + IAI • (4) 

V J=l i=l / 



In our applications we let = \>K for all J = 1, . . . ,p and = 1 for all 
i = 1, . . . ,Kp, but other choices are possible in the implementation. Note 
that the parameter grouping, as part of the penalty specification, is given in 
terms of the columns in (|2j), i.e. m = p. 

3. Data examples 

The data sets were preprocessed before applying the multinomial sparse 
group lasso estimator. Two preprocessing schemes were used: normalization 
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Data set Features K N p 

Cancer sites microRNA expressions 18 162 217 

Amazon reviews Various textual features 50 1500 10k 

Muscle diseases Gene expression 10 107 22k 

Table 1: Summary of data sets and the associated classification problem. 

and standardization. Normalization entails centering and scaling of the sam- 
ples in order to obtain a design matrix with row means of and row variance 
of 1. Standardization involves centering and scaling the features to obtain a 
design matrix with column means of and column variance of 1. Note that 
the order in which normalization and standardization are applied matters. 

The purpose of normalization is to remove technical (non-biological) vari- 
ation. A range of different normalization procedures exist for biological data. 
Sample centering and scaling is one of the simpler procedures. We use this 
simple normalization procedure for the two biological data sets in this paper. 
Normalization is done before and independent from the sparse group lasso 
algorithm. 

The purpose of standardization is to create a common scale for features. 
This ensures that differences in scale will not influence the penalty and thus 
the variable selection. Standardization is an option for the sparse group lasso 
implementation, and it is applied as the last preprocessing step for all three 
example data sets. 

We want to compare the performance of the multinomial sparse group 
lasso estimator for different values of the regularization parameter a. Ap- 
plying the multinomial sparse group lasso estimator with a given a G [0,1] 
and A-sequence, Ai, . . . , A^ > 0, results in a sequence of estimated models 
with parameters {/3(A;, a)}^^...^. The generalization error can be estimated 
by cross validation [TTj. For our applications we keep the sample ratio be- 
tween classes in the cross validation subsets approximately fixed to that of the 
entire data set. Hence, we may compute a sequence, {Err(Aj, a)}i=i,...,d, of es- 
timated expected generalization errors for the sequence of models. However, 
for given a>i and a 2 we cannot simply compare Err(Aj, a{) and Err(Aj,a 2 ), 
since the Aj value is scaled differently for different values of a. We will in- 
stead compare the models with the same number of non-zero parameters and 
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the same number of non-zero parameter groups, respectively. Define 



e(A,«) d = f J]/(/3( J )(A,a)^0) 



clef 

J=l 



with /3(A, a) the estimator of (5 for the given values of A and a. That is, 
0(A,a) is the number of non-zero parameter blocks in the fitted model. 
Note that there is a one-to-one correspondence between parameter blocks 
and features in the design matrix. Furthermore, we define the total number 
of non-zero parameters as 



n(A,a) d = f X>(A(A,«)^0). 



i=l 



In particular, we want to compare the fitted models with the same num- 
ber of parameter blocks. There may, however, be more than one A-value 
corresponding to a given value of 6. Thus we compare the models on a 
subsequence of the A-sequence. With 9i < ■ ■ ■ < 9$ for d' < d denoting the 
different elements of the set {6(Aj, a)}^^...^ in increasing order we define 



Aj(a)= f min|A 6(A,a) = 6'j| 



We then compare the characteristics of the multinomial sparse group lasso 
estimators for different a values by comparing the estimates 



{(Err(A,(a),«),e(A,(a)),fl(A,(«)) 



-l,...,d' 



3.1. Cancer sites 

The data set consists of bead-based expression data for 217 microRNAs 
from normal and cancer tissue samples. The samples are divided into 11 
normal classes, 16 tumor classes and 8 tumor cell line classes. For the pur- 
pose of this study we select the normal and tumor classes with more than 
5 samples. This results in an 18 class data set with 162 samples. The data 
set is unbalanced, with the number of samples in each class ranging from 5 
to 26 and with an average of 9 samples per class. Data was normalized and 
then standardized before running the sparse group lasso algorithm. For more 
information about this data set see [7]. The data set is available from the 
Gene Expression Omnibus with accession number GSE2564. 
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Figure 1: Estimated expected generalization error, for different values of a, for the mi- 
croRNA cancer sites data set. The cross validation based estimate of the expected misclas- 
sification error is plotted against the number of non-zero parameter blocks in the model 
(left), and against the number of non-zero parameters in the model (right). The estimated 
standard error is approximately 0.03 for all models. 



Figure [I] shows the result of a 10-fold cross validation for 5 different values 
of a, including the lasso and group lasso. The A-sequence runs from A max 
to 10~ 4 , with d = 200. It is evident that the group lasso and sparse group 
lasso models achieve a lower expected error using fewer genes than the lasso 
model. However, models with a low a value have a larger number of non-zero 
parameters than models with a high a value. A reasonable compromise could 
be the model with a = 0.25. This model does not only have a low estimated 
expected error, but the low error is also achieved with a lower estimated 
number of non-zero parameters, compared to group lasso. 

3.2. Amazon reviews 

The Amazon review data set consists of 10k textual features (including 
lexical, syntactic, idiosyncratic and content features) extracted from 1500 
customer reviews from the Amazon Commerce Website. The reviews were 
collected among the reviews from 50 authors with 50 reviews per author. 
The primary classification task is to identify the author based on the textual 
features. The data and feature set were presented in [S] and can be found in 
the UCI machine learning repository [T2] . In [8] a Synergetic Neural Network 
is used for author classification, and a 2k feature based 10-fold CV accuracy 
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Figure 2: Estimated expected generalization error, for different values of a, for the Amazon 
reviews author classification problem. The cross validation based estimate of expected 
misclassification error is plotted against the number of non-zero parameter blocks in the 
model (left), and against the number of non-zero parameters in the model (right). The 
estimated standard error is approximately 0.01 for all models. 

of 0.805 is reported. The feature selection and training of the classifier were 
done separately. 

We did 10-fold cross validation using multinomial sparse group lasso for 
five different values of a. The results are shown in Figure [2] The A-sequence 
runs from A max to 10~ 4 , with d = 100. The design matrix is sparse for 
this data set. Our implementation of the multinomial sparse group lasso 
algorithm utilizes the sparse design matrix to gain speed and for memory 
efficiency No normalization was applied for this data set. Features were 
scaled to have variance 1, but were not centered. 

For this data set it is evident that lasso performs badly and that the 
group lasso performs best - in fact much better than lasso. The group lasso 
achieves an accuracy of around 0.82 with a feature set of size ~ lk. This 
outperforms the neural network in [8]. 

3.3. Muscle diseases 

This data set consists of messenger RNA array expression data of 119 
muscle biopsies from patients with various muscle diseases. The samples are 
divided into 13 diagnostic groups. For this study we only consider classes 
with more than 5 samples. This results in a classification problem with 107 
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Figure 3: Estimated expected generalization error, for different values of a, for the muscle 
disease classification problem. The cross validation based estimate of expected rnisclas- 
sification error is plotted against the number of non-zero parameter blocks in the model 
(left), and against the number of non-zero parameters in the model (right) The estimated 
standard error is approximately 0.04 for all models. 



samples and 10 classes. The data set is unbalanced with class sizes ranging 
from 4 to 20 samples per class. Data was normalized and then standardized 
before running the sparse group lasso algorithm. For background information 
on this data set, see j9]. The data set is available from the Gene Expression 
Omnibus with accession number GDS1956. 

The results of a 10-fold cross validation are shown in Figure [3] The A- 
sequence runs from A max to 10~ 5 , with d = 200. We see the same trend as 
in the other two data examples. Again the group lasso models perform well, 
however not significantly better than the closest sparse group lasso models 
(a = 0.25). The lasso models perform reasonably well on this data set, but 
they are still outperformed by the sparse group lasso models. 

4. A simulation study 

In this section we investigate the characteristics of the sparse group lasso 
estimator on simulated data sets. We are primarily interested in trends in 
the generalization error as a is varied and A is selected by cross validation 
on a relatively small training set. We suspect that this trend will depend 
on the distribution of the data. We restrict our attention to multiclass data 
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where the distribution of the features given the class is Gaussian. Loosely 
speaking, we suspect that if the differences in the data distributions are very 
sparse, i.e. the centers of the Gaussian distributions are mostly identical 
across classes, the lasso will produce models with the lowest generalization 
error. If the data distribution is sparse, but not very sparse, then the optimal 
a is in the interval (0, 1). For a dense distribution, typically with differences 
being among all classes, we expect the group lasso to perform best. The 
simulation study confirms this. 

The mathematical formulation is as follows. Let 

where ^ G W for i = 1, . . . , K and p = p a + p b - Denote by a data 
set consisting of ./V samples for each of the K classes - each sampled from 
the Gaussian distribution with centers /ii, . . . respectively, and with a 
common covariance matrix E. Let A be the smallest A- value with the minimal 
estimated expected generalization error, as determined by cross validation on 
V^. Denote by Err M (A, a) the generalization error of the model /3(A, a) that 
has been estimated from the training set V^, by the sparse group lasso, for 
the given values of A and a. Then let 

Z^a) = Err A1 (A, a) - Err Baye s(/i) 

where ErrB a yes(/-0 is the Bayes rate. We are interested in trends in Z M , as a 
function of a, for different configurations of fj,i, . . . ,/ik- To be specific, we 
will sample fj,i, . . . , /Ik from one of the following distributions: 

• A sparse model distribution, where the first p a entries of are i.i.d. 
with a distribution that is a mixture of the uniform distribution on 
[—2, 2] and the degenerate distribution at with point probability p . 

• A dense model distribution, where the first p a entries of /ij are i.i.d. 
Laplace distributed with location and scale b. 

The last p b entries are zero. We take p a = [5/(1 — p )\ throughout for 
the sparse model distribution. The within class covariance matrix E is con- 
structed using features from the cancer site data set. Let Eo be the empirical 
covariance matrix of p randomly chosen features. To avoid that the covari- 
ance matrix become singular we take 

E = (1 - <$)Eo + <5I 
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Figure 4: The estimated expected error gab (solid black line) for the three configurations. 
The central 95% of the distribution of Z^a) is shown as the shaded area on the plot. The 
error gab for 5 randomly selected /i-configurations is shown (red dashed lines). 



for 6 G (0,1). 

The primary quantity of interest is 



err(a) = f E (Z,. 



:<*)) 



(5) 



the expectation being over \x and the data set V^. We are also interested in 
how well we can estimate the non-zero patterns of the /Vs. Consider this 
as Kp two class classification problems, one for each parameter, where we 
predict the ^ to be non-zero if /3y is non-zero, and fiij to be zero otherwise. 
We calculate the number of false positives, true positives, false negatives and 
true negatives. The positive predictive value (ppv) and the true positive 
rate (tpr) are of particular interest. The true positive rate measures how 
sensitive a given method is at discovering non-zero entries. The positive 
predictive value measures the precision with which the method is selecting 
the non-zero entries. We consider the following two quantities 



tpr(a 



def 



E 



tpr [f3{\,a] 



and ppv(a) = f E 



ppv lp(\,a) 



(6) 



In order to estimate the quantities ^ and ^ we sample M configurations 
of n from one of the above distributions. For each configuration we sample 
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a training and a test data set of sizes NK and lOOif, respectively. Using 
the training data set we fit the model /3(A, a) and estimate Z^ia) using the 
test data set. Estimates err(a), tpr(a) and ppv(a) are the corresponding 
averages over the M configurations. 

For this study we chose M = 100, N = 15, K = 25, p b = 50, 5 = 0.25 
and the following three configuration distributions: 

• Thin configurations, where the centers are distributed according to the 
sparse model distribution with p = 0.95, as defined above. 

• Sparse configurations, where the centers are distributed according to 
the sparse model distribution with p = 0.80. 

• Dense configurations, where the centers are distributed according to 
the dense model distribution with scale b = 0.2 and p a = 25. 

In Figure [4] we see that for thin configurations the lasso has the lowest 
estimated error gab, along with the sparse group lasso with a = 0.8. For the 
sparse configurations the results indicate that the optimal choice of a is in 
the open interval (0, 1), but in this case all choices of a result in a comparable 
error gab. For the dense configurations the group lasso is among the methods 
with the lowest error gab. 

In Figure [5] we plotted the true positive rate for the three configurations. 
Except for the thin configurations, the lasso is markedly less sensitive than 
the sparse group and group lasso methods. However, looking at Figure [6] we 
see that the sparse group and group lasso methods have a lower precision 
than the lasso, except for the dense configurations. We note that the group 
lasso has the worst precision, except for the dense configurations. 

Part II 

The sparse group lasso 
algorithm 

In this section we present the sparse group lasso algorithm. The algorithm 
is applicable to a broad class of loss functions. Specifically, we require that 
the loss function / : IR n — » R is convex, twice continuously differentiable and 
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Figure 5: The estimated expected true positive rate (solid black line) for the three con- 
figurations. The central 95% of the distribution of tpr is shown as the shaded area on the 
plot. The true positive rate for 5 randomly selected /i-configurations is shown (red dashed 
lines). 




Figure 6: The estimated expected positive predictive value (solid black line) for the three 
configurations. The central 95% of the distribution of ppv is shown as the shaded area on 
the plot. The positive predictive value for 5 randomly selected /i-configurations is shown 
(red dashed lines). 
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bounded below. Additionally, we require that all quadratic approximations 
around a point in the sublevel set 



{0 E R n | /(/?) + A$(/3) < f(p ) + A$(/3 ) } 

are bounded below, where /3q G R n is the initial point. The last requirement 
will ensure that all subproblems are well defined. 

The algorithm solves Q for a decreasing sequence of A values ranging 
from A max to a user specified A min . The algorithm consists of four nested 
main loops: 

• A numerical continuation loop, decreasing A. 

• An outer coordinate gradient descent loop (Algorithm [T]) . 

• A middle block coordinate descent loop (Algorithm [2]) . 

• An inner modified coordinate descent loop (Algorithm [3]) . 



In Section 4^3 to 4J3 we discuss the outer, middle and inner loop, respec- 
tively. In Section we develop a method allowing us to bypass compu- 
tations of large parts of the Hessian, hereby improving the performance of 
the middle loop. Section [5] provides a discussion of the available software 
solutions, as well as run-time performance of the current implementation. 

The theoretical basis of the optimization methods we apply can be found 



in [61 [5] . A short review tailored to this paper is given in Appendix A 



4-1. The sparse group lasso penalty 

In this section we derive fundamental results regarding the sparse group 
lasso penalty. 

We first observe that $ is separable in the sense that if, for any group 
J G 1, ... ,771, we define the convex penalty $ (J) : R nj -> R by 



$( J )(x) d = f (l-a) 7 j||x|| 2 + «^e 



(J) 



1 1 



i=l 



then $(/3) = Y^j=i ^ J \^ JS> )- Separability of the penalty is required to 
ensure convergence of coordinate descent methods, see E], and see also 



Appendix A 
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In a block coordinate descent scheme the primary minimization problem 
is solved by minimizing each block, one at a time, until convergence. We 
consider conditions ensuring that 

G argmin g(x) + A$ (J) (x) (7) 

x£R n J 

for a given convex and twice continuously different iable function g : R nj — > R. 
For J = 1, . . . ,m a straightforward calculation shows that the subgradient 
of $( J ) at zero is 

«9$( J )(0) = (1 - a)^jB nj + adiag(£ (J) )T n/ 

where B n = {xGl" ||x|| 2 < 1}, T n = [-1,1]" and where for x G R n 
diag(x) denotes the nx n diagonal matrix with diagonal x. For an introduc- 
tion to the theory of subgradients see Chapter 4 in [T3] . 

Proposition [l] below gives a necessary and sufficient condition for ^ to 
hold. Before we state the proposition the following definition is needed. 

Definition 2. For n G N we define the map k : W 1 x W 1 — y W 1 by 

def JO \Zi\ <Vi 

for i — 1 , . . . , n 

Zi — sgn(^)f i otherwise 
and the function K : R n x R n R by 

K(v,z) = \\k(v,z)\\1 = ^2 - sgn(zi)ui) 2 . 

{i | \zi\>Vi } 

Proposition 1. Assume given a > 0, v , z G lR n and define the closed sets 

Y = z + diag(»T n and X = aB n + Y. 
Then the following hold: 

a. k(v,z) = argmin ||y|| 2 . 

b. G X if and only if K(v, z) < a 2 . 

c. If K(v, z) > a 2 then argmin ||x|| 2 = ( 1 — a/ a/ K(v, z) J k(v, z). 
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The proof of Proposition [T] is given in Appendix C Proposition [T] implies 
that ([7]) holds if and only if 



K(Aa£( J \Vs(0))<A(l-a)7j. 

The following observations will prove to be valuable. Note that we use ^ 
to denote coordinatewise ordering. 

Lemma 1. For any three vectors v,z,z' € 1R™ the following hold: 

a. K(v, z) = K(v, \z\). 

b. K(v,z) < K(v,z') when \z\ ^ \z'\. 

Proof, (a) is a simple calculation and (b) is a consequence of the definition 
and (a). □ 

4-2. The \-sequence 

For sufficiently large A values the only solution to ([!]) will be zero. We 
denote the infimum of these by A max . By using the above observations it is 
clear that 



Amax = f inf <! A > 



/3(A) = 



inf <^ A > 



VJ = 1, . . . , m : ^K(Xa^ J \ V/(0) (J) ) < A(l - a) 7j 



max inf < A > 

J=l,...,m 



yjK(\a£V\Vf(0) (J) ) < A(l - a) 7 j 



It is possible to compute 



inf < A > 



^K(\ae j Wf(0) {J) ) < A(l - ahj 



by using the fact that the function A — > K(Xa^ J \ V/(0)^ J ' ) ) is piecewise 
quadratic and monotone. 
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4-3. Outer loop 

In the outer loop a coordinate gradient descent scheme is used. In this 
paper we use the simplest form of this scheme. In this simple form the 
coordinate gradient descent method is similar to Newton's method; however 
the important difference is the way the non-differentiable penalty is handled. 
The convergence of the coordinate gradient descent method is not trivial and 
is established in [5]. 

The algorithm is based on a quadratic approximation of the loss function 
/, at the current estimate of the minimizer. The difference, A, between the 
minimizer of the penalized quadratic approximation and the current estimate 
is then a descent direction. A new estimate of the minimizer of the objective 
is found by applying a line search in the direction of A. We repeat this 
until a stopping condition is met, see Algorithm [TJ Note that a line search is 
necessary in order to ensure global convergence. For most iterations, however, 
a step size of 1 will give sufficient decrease in the objective. With q = 
V/(/5) and H = V 2 /(/9) the quadratic approximation of / around the current 
estimate, (3, is 

q T (x-P) + ^(x-f3) T H(x-(3) 

= q T x - q T p + X -x T Ex - l - ((3 T Hx + x T H$) + ^(3 T H{3. 

H is symmetric, thus it follows that the quadratic approximation of / around 
(3 equals 

Q(x)-q T (3 + ^ T H(3, 
where Q : 1R 71 — > R is defined by 

Q(x) d ^ f ( q - H(3) T x + \x T Hx. 

We have reduced problem ([T]) to the following penalized quadratic optimiza- 
tion problem 

mmQ(x) + A$(x). (8) 

The convergence of Algorithm [T] is implied by Theorem le in [5] . This 
implies: 

Proposition 2. Every cluster point of the sequence {(3k}ken generated by 
AlgorithmUjis a solution of problem [w. 
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Algorithm 1 Outer loop. Solve (TTj) by coordinate gradient descent. 



Require: (3 = (3q 
repeat 

Let q = V/(/5), H = V 2 f(/3) and Q(x) = (q - H/3) T x + \x T Hx. 
Compute (3 = argmin(5(x) + A$(x). 

:reIR n 

Compute step size t and set = (3 + tA, for A = (3 — j3. 
until stopping condition is met. 



Remark 1. The convergence of Algorithm^ is ensured even if H is a (sym- 
metric) positive definite matrix approximating V 2 /(/3). For high dimensional 
problems it might be computationally beneficial to take H to be diagonal, e.g. 
as the diagonal ofV 2 f((3). 

4-4- Middle loop 

In the middle loop the penalized quadratic optimization problem ^ is 
solved. The penalty $ is block separable, i.e. 



Q{x) + \<5>{x) =Q( X ) + \Y,<5> 



{j)( x (jy 



J=0 



with $( J ) convex, and we can therefore use the block coordinate descent 
method over the blocks x^ 1 ', . . . , x^ m \ The block coordinate descent method 
will converge to a minimizer even for non-differentiable objectives if the non- 
differentiable parts are block separable, see [5J. Since $ is separable and 
Q is convex, twice continuously differentiable and bounded below, the block 
coordinate descent scheme converges to the minimizer of problem ([8]). Hence, 
our problem is reduced to the following collection of problems, one for each 
J — 1, ... ,m, 

min Q (J) (x) + A$ (J) (x) (9) 

x£R n J 

where : W 1 - 1 — > R is the quadratic function 

*Xj 7 \ , * J 1 ' ' ' 1 ? 5 1 ' ' ' 1 ) 
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H 12 ■ ■ 






H22 • ■ 


H2m 


\H m i 


H m 2 ■ ■ 


H m m J 



up to an additive constant. We decompose an n x n matrix H into block 
matrices in the following way 



H 



where Hjj is an rij x nj matrix. By the symmetry of H it follows that 
Q {J \x) = x T {q - #/3) {J) + ^ 2 E * Th JiX (I) ~ x T H JjX ^ + x T Hjj^j 

= x T (g( J ) + [H(x - (3)] (J) - Hjjx^) + \x T H JjX 
up to an additive constant. We may, therefore, redefine 

Q^{x)^x T g^+ l -x T Hjjx 
where the block gradient is defined by 
9 



W^qW + [H{x-P)]W-HjjxW. 



(10) 



For the collection of problems given by (J9l) a considerable fraction of the 
minimizers will be zero in practice. By Lemma [T] this is the case if and only 
if 

^K(Xae j \g {J) ) < A(l - a) 7 j. 
These considerations lead us to Algorithm [2] 

4-5. Inner loop 

Finally we need to determine the minimizer of @, i.e. the minimizer of 



>i j 



Q(J)(Z) + X(l- a )yj\\x\\ 2 + aJ2C. 



(J) 



X ; 



(11) 



i=Q 



loss 



penalty 



The two first terms of (11) are considered the loss function and the last 
term is the penalty. Note that the loss is not differentiable at zero (due to 



20 



Algorithm 2 Middle loop. Solve ( 



8 



) by block coordinate descent. 



repeat 

Choose next block index J according to the cyclic rule. 

Compute the block gradient g( J \ 

if ^/KiXa^^g^) < A(l - a) 7j then 

Let = 0. 
else 

Let x^ = argminQ^x) + \& J ^(x). 
end if 

until stopping condition is met. 



the L 2 -norm), thus we cannot completely separate out the non-differentiable 
parts. This implies that ordinary block coordinate descent is not guaranteed 
to converge to a minimizer. Algorithm [3] adjusts for this problem, and we 
have the following proposition. 

Proposition 3. For any e > the cluster points of the sequence {xk}k£N 
generated by Algorithm^ are minimizers of |77] ). 

Proof. Since Q^ J \0) + A$^(0) = Algorithm [3] is a modified block coordi- 
nate descent scheme. Furthermore J is chosen such that (11) is not optimal 
at 0. We can therefore apply Lemma [4] in Appendix B from which the claim 
follows directly. □ 

Hence, for a given block J = 1, . . . , m we need to solve the following two 
problems: 

I. For each j — 1, . . . nj, compute a minimizer for the function 

to =s £ _i. n( J )/v( J ) r,.^) /t. rf,( J ) ^( J h 



II. Compute a descent direction at zero for (11). 

Regarding I. Writing out the equation we see that in the j'th iteration we 
need to find the minimizer of the function u : IR. — > M. given by 

uj(x) = cx + -jhx 2 + ^yy/x 2 + r + £ \x\ (12) 
21 



with c = g) + ,(//././ ),,./',. 7 = A(l - a)j J: £ = XaQ >, r = T,i^j x h 
and where h is the j'th diagonal of the Hessian block Hjj. 

By convexity of / we conclude that h > 0. Lemma [2] below deals with 
the case h > 0. Since the quadratic approximation Q is bounded below the 
case h = implies that c = 0, hence for /i = we have x = 0. 

Lemma 2. If h > then the minimizer x of u is given as follows: 

a. If r = or 7 = then 

»/ c> ^ + 7 
»/ |c| < £ + 7 
*/ c < -£ - 7 



5. 7/r > 0,7 > then x = if \c\ < £ and otherwise the solution to 

x 



c + sgn(£ — c)£ + /ix + 7- 



Vx 2 + r 

Proof. Simple calculations will show the results. 



0. 



□ 



For case (b) in the above lemma we solve the equation by applying a 
standard root finding method. 

Regarding II. For a convex function / : IR n — > R and a point a; £ M. n , the 
vector 



A 



-argmm \\x\\ 2 

xedf(x) 



is a descent direction at x provided / is not optimal at x, see [13], Section 
8.4. We may use this fact to compute a descent direction at zero for the 
function (11). By Proposition [l] it follows that A £ W 1 defined by 



A, ^ - 







g\ JS> — \a^\ J> sgn((^ JJ ) otherwise 



is a descent direction at zero for the function (11). 
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Algorithm 3 Inner loop. Compute the minimizer of (11) by a modified 

coordinate descent scheme. 

repeat 

Choose next parameter index j according to the cyclic rule. 
Compute 

(J) • n(J)/ {J) (J) ~ (J) (J)s 

x) = argmm Q l >(x\ \ . . . , x)J u x, x)^, . . .,x K n J) 



if ||x (J) || 2 < e and Q^ J \x^) + \& J \x^) > then 
Compute a descent direction, A, at zero for (11). 
Use line search to find t such that Q iJ) (tA) + A$ (J) (tA) < 0. 
Let x (J) = tA 

end if 

until stopping condition is met. 



4-6. Hessian upper bound optimization 

In this section we present a way of reducing 
which the block gradient needs to be computed, 
computational costs of the algorithm. 

In the middle loop, Algorithm |5J the block g 
for all m blocks. To determine if a block is zero 
compute an upper bound on the block gradient, 
already computed we focus on the term involving 
J = 1, . . . , m, we compute a bj G E such that 



the number of blocks for 
The aim is to reduce the 



radient (10) is computed 
it is, in fact, sufficient to 
Since the gradient, q, is 
the Hessian. That is, for 



[H(x-P)] (J) 



d bjD nj 



where D n = (1, 1, . . . , 1) £ W 1 . We define 



tj = f sup <( x > 



Aj(A< (J) , M J) I + xD nj ) < A(l - a) 7 j 



when a/Aj(Ao;^ j ), |(?^ J ^) < A(l — a)7j and otherwise let ij = 0. When 
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bj < tj it follows by Lemma [T] that 

Kj(Xae J \g (J) ) = Kj(Xae J) ,\g iJ) \) 

<Kj(Xa^ J \\q {J) \+bjD n} ) 

<A 2 (1-«)M 

and by Proposition [T] this implies that the block J is zero. The above con- 
siderations lead us to Algorithm |4j Note that it is possible to compute the 
tj's by using the fact that the function 

1914 Kj(Xa^ J \ \q {J) \+xD nj ) 

is monotone and piecewise quadratic. 

Algorithm 4 Middle loop with Hessian bound optimization, 
repeat 

Choose next block index J according to the cyclic rule. 
Compute upper bound bj. 
if bj < tj then 
Let x (J) = 0. 
else 

Compute g( J ^ and compute new (see Algorithm pjj. 
end if 

until stopping condition is met. 



In Algorithm [4] it is only necessary to compute the block gradient for 
those blocks where x^ J > ^ or when bj < tj. This is only beneficial if we can 
efficiently compute a sufficiently good bound bj. For a broad class of loss 
functions this can be done using the Cauchy-Schwarz inequality. 

To assess the performance of the Hessian bound scheme we used our 
multinomial sparse group lasso implementation with and without bound op- 
timization (and with a = 0.5). Table[2]lists the ratio of the run-time without 
using bound optimization to the run-time with bound optimization, on the 
three different data sets. The Hessian bound scheme decreases the run-time 
for the multinomial loss function, and the ratio increases with the number of 
blocks m in the data set. The same trend can be seen for other values of a. 
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Data set n m Ratio 



Cancer 3.9k 217 
Amazon 500k 10k 
Muscle 220k 22k 



1.14 
1.76 
2.47 



Tabic 2: Timing the Hessian bound optimization scheme. 



5. Software 

We provide two software solutions in relation to the current paper. An 
R package, msgl, with a relatively simple interface to our multinomial and 
logistic sparse group lasso regression routines. In addition, a C++ template 
library, sgl, is provided. The sgl template library gives access to the generic 
sparse group lasso routines. The R package relies on this library. The sgl 
template library relies on several external libraries. We use the Armadillo C++ 
library [H] as our primary linear algebra engine. Armadillo is a C++ template 
library using expression template techniques to optimize the performance of 
matrix expressions. Furthermore we utilize several Boost libraries [T5] . Boost 
is a collection of free peer-reviewed C++ libraries, many of which are template 
libraries. For an introduction to these libraries see for example [16J. Use of 
multiple processors for cross validation and subsampling is supported through 
OpenMP [17] . 

The msgl R package is available from CRAN. The sgl library is available 
upon request. 

5.1. Run-time performance 

Table [3] lists run-times of the current multinomial sparse group lasso im- 
plementation for three real data examples. For comparison, the glmnet uses 
5.2s, 8.3s and 137.0s, respectively, to fit the lasso path for the three data sets 
in Table [3j The glmnet is a fast implementation of the coordinate descent 
algorithm for fitting generalized linear models with the lasso penalty or the 
elastic net penalty [TO] ■ The glmnet cannot be used to fit models with group 
lasso or sparse group lasso penalty. 

6. Conclusion 

We developed an algorithm for solving the sparse group lasso optimiza- 
tion problem with a general convex loss function. Furthermore, convergence 
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T Sparse group lasso „ , 

Lasso _ „ „_ Group lasso 

a = 0.75 a = 0.25 F 



Data set n m 

Cancer 3.9k 217 5.9s 

Muscle 220k 22k 25.0s 

Amazon 500k 10k 331.6s 



4.8s 6.3s 6.0s 

25.8s 37.7s 36.7s 
246.7s 480.4s 285.1s 



Table 3: Times for computing the multinomial sparse group lasso regression solutions for 
a lambda sequence of length 100, on a 2.20 GHz Intel Core i7 processor (using one thread). 
In all cases the sequence runs from A max to 0.002. The number of samples in the data sets 
Cancer, Muscle and Amazon are respectively 162, 107 and 1500. See also Table [T] and the 
discussions in Sections 3.1 3.3 and |3.2| respectively. 



of the algorithm was established in a general framework. This framework in- 
cludes the sparse group lasso penalized negative-log-likelihood for the multi- 
nomial distribution, which is of primary interest for multiclass classification 
problems. 

We implemented the algorithm as a C++ template library. An R pack- 
age is available for the multinomial and the logistic regression loss functions. 
We presented applications to multiclass classification problems using three 
real data examples. The multinomial group lasso solution achieved optimal 
performance in all three examples in terms of estimated expected misclas- 
sification error. In one example some sparse group lasso solutions achieved 
comparable performance based on fewer features. If there is cost associated 
with the acquisition of each feature, the objective would be to minimize the 
cost while optimizing the classification performance. In general, the sparse 
group lasso solutions provide sparser solutions than the group lasso. These 
sparser solutions can be of interest for model selection purposes and for in- 
terpretation of the model. 

Appendix A. Block coordinate descent methods 

In this section we review the theoretical basis of the optimization methods 
that we apply in the sparse group lasso algorithm. We use three slightly 
different methods: a coordinate gradient descent, a block coordinate descent 
and a modified block coordinate descent. 

We are interested in unconstrained optimization problems on IR n where 
the coordinates are naturally divided into m G N blocks with dimensions 
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rii & N for i = 1, . . . , m. We decompose the search space 

E n = M ni x • • • x R nm 

and denote by Pj the orthogonal projection onto the i'th block. For a vector 
x E W 1 we write x = (x {1 \ x {m) ) where el" 1 ,,.., x {m) G R nm . For 
% = 1, . . . , m we call the i'th block of x. We assume that the objective 

function F : IR n — > K is bounded below and of the form 

in 

F(x)=f(x)+Y,hi(x®) 
i=i 

where / : lR n — > K is convex and each /ij : IR ni — >■ K, for z = 1, . . . ,m 
are convex. Furthermore, we assume that for any % = l,...,m and any 
x = (xo^\ . . . , Xq-™}) the function 

l n '9i^ F{x {1 \ Xo (i_1) , x, x (4+1) , • • • , x (m) ) 

is hemivariate. A function is said to be hemivariate if it is not constant on 
any line segment of its domain. 

Appendix A.l. Coordinate gradient descent 

Algorithm 5 Coordinate gradient descent scheme, 
repeat 

Compute quadratic approximation Q of f around the current point x. 
Compute search direction 

m 

x new = argminQ(x) + V hi . 

x£R n ■ ; 

Let A = x — x new and compute step size t using the Armijo rule and let 
X i — x -\- tA. 
until stopping condition is met. 



For this scheme we make the additional assumption that / is twice con- 
tinuously differentiable everywhere. The scheme is outlined in Algorithm [5j 
where the step size is chosen by the Armijo rule outlined in Algorithm |6j 
Theorem le in [5] implies the following: 

Corollary 1. If f is twice continuously differentiable then every cluster point 
of the sequence {xk}ken generated by Algorithm^ is a minimizer of F . 
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Algorithm 6 Armijo rule. 
Require: a E (0, 0.5) and b E (0, 1) 

Let 5 = V/(x) r A + YZi (Hm + A<) - K^)). 

while F(x + tA) > F(x) + tab do 
t <- bt. 

end while 



Appendix A. 2. Block coordinate descent 



Algorithm 7 Block coordinate descent, 
repeat 

Choose next block index % according to the cyclic rule. 
<— argminF(x © Pt x )- 

until some stopping condition is met. 



The block coordinate descent scheme is outlined in Algorithm [7j By 
Corollary [2] below the block coordinate descent method converges to a coor- 
dinatewise minimum. 

Definition 3. A point p E M. n is said to be a coordinatewise minimizer of F 
if for each block i — 1, . . . ,m it holds that 

F(p + (0, . . . , 0, d h 0, . . . , 0)) > F(p) for all <U E R Hi . 

If / is differentiable then by Lemma [3] the block coordinate descent 
method converges to a minimizer. Lemma [3] below is a simple consequence 
of the separability of F. 

Lemma 3. Let p EW 1 be a coordinatewise minimizer of F. If f is differen- 
tiable at p then p is a stationary point of F. 

Proposition 5.1 in [6] implies the following: 

Corollary 2. For the sequence {xk}km generated by the block coordinate 
descent algorithm (Algorithm^ it holds that every cluster point of {xk}k<=N 
is a coordinatewise minimizer of F. 
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Algorithm 8 Modified coordinate descent loop, 
repeat 

Let i i + 1 mod m. 

-f- argminF(a; © Pt x )- 

if \\x — p\\ 2 < e and F(x) > F{p) then 
Compute descent direction A at p for F. 
Use line search to find t such that F(p + tA) < F(p). 
Let x® <-p + tA. 

end if 

until stopping condition is met. 



Appendix B. Modified block coordinate descent 

For this last scheme we make the additional assumption that / is twice 
continuously differentiable everywhere except at a given non-optimal point 
p G W 1 . In this case the block coordinate descent method is no longer guar- 
anteed to be globally convergent, as it may get stuck at p. One immediate 
solution to this is to compute a descent direction at p, then use a line search 
to find a starting point Xq with F(xo) < F(p). Since / is differentiable on 
the sublevel set {x G M. n \ F(x) < F{p) } it follows by the results above that 
the cluster points of the generated sequence are stationary points of F. This 
procedure is not efficient since it discards a carefully chosen starting point. 
We apply the modified coordinate descent loop, outlined in Algorithm |8j 
instead. 

Lemma 4. Assume that f is differentiable everywhere except at p G W 1 , and 
that F is not optimal at p. Then for any e > the cluster points of the 
sequence {x k }keN generated by Algorithm^ are minimizers of F. 

Proof. Let z be a cluster point of {x k }. By Corollary [2J z is a coordinatewise 
minimizer of F. Then Lemma [3] implies that z is either p or a stationary point 
of F. We shall show by contradiction that p is not a cluster point of {a^j^N, 
thus assume otherwise. The sequence {F(x k )}k£N is decreasing; hence, if 
we can find a k' G N such that F(x k ) < F(p) we reach a contradiction 
(since this would conflict with the continuity of F). Choose k' such that 
\\x k — p\\ 2 < e- Since we may assume that F(x k ) > F(p) it follows by the 
definition of Algorithm [1] that F(x k ' +1 ) < F(p). □ 
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Appendix C. Proof of Proposition [T] 

(a) Straightforward. 

(b) If ||/c(u, z)|| 2 < a then —k(v,z) G aB n hence G X. For the other 
implication simply choose yo G Y such that — yo G aB n and note that 
\\k\v,z)\\ 2 < ||j/ || 2 < a. 

(c) Assume \\k(v, z)\\ 2 > a, and let x* — (1 — a/ ||/«(t>, z) || 2 )/t(t>, z). Then 
a;* G X and ||a:*|| 2 = \\k(v, z)\\ 2 — a. The point x* is in fact a minimizer. To 
see this let x' G X, that is we have 

x' = z + as + diag(f )t 

for some s G B n and t G T n . It follows, by the triangle inequality and (a), 
that 

\\x'\\ 2 + a > \\x' — as\\ 2 = \\z + diag(t>)i|| 2 > \\k(v , z)\\ 2 . 

So ||x'|| 2 > ||k(u,2:)|| 2 — a = \\x*\\ 2 and since X is convex and x — > \\x\\ 2 is 
strictly convex the found minimizer x* is the unique minimizer. 
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